This notebook presents the first empirical study of the paper
Forking paths in empirical studies available on SSRN.
The idea of the paper is to decompose any empirical analysis into a
series of steps, which are pompously called
mapppings in the paper.
Each step matters and can have an important impact on the outcome of the
study.
Because of that, we argue that the researcher should, if possible,
keep track of all the possible modelling options and
present the distribution of outcomes (e.g., t-statistics or p-values)
across all configurations.
Below, some code chunks are shown (because they are possibly
insightful), others are not (especially for plots).
It is easy to access all code content by clicking on the related
buttons.
First, we test the data importation & load the libraries.
The data, downloaded from Amit Goyal’s website
is stored on a Github repo.
It was used in the follow-up paper A
Comprehensive Look at the Empirical Performance of Equity Premium
Prediction II.
Because we do not want to download the data multiple times, we do it
only once, for three sheets (monthly, quarterly & yearly data).
library(tidyverse) # Data wrangling & plotting
library(readxl) # To read excel files
library(zoo) # For data imputation
library(DescTools) # For winsorization
library(sandwich) # HAC estimator
library(lmtest) # Statistical inference
library(furrr) # Parallel computing
library(viridis) # Color palette
library(patchwork) # Graph layout
library(xtable) # LaTeX exports
library(reshape2) # List management
library(stabledist) # For stable distributions
library(ptsuite) # For tail estimation
loadWorkbook_url <- function(sheet) { # Function that downloads the data from online file
url = "https://github.com/shokru/coqueret.github.io/blob/master/files/misc/PredictorData2021.xlsx?raw=true"
temp_file <- tempfile(fileext = ".xlsx")
download.file(url = url, destfile = temp_file, mode = "wb", quiet = TRUE)
read_excel(temp_file, sheet = sheet)
}
data_month <- loadWorkbook_url(1) # Dataframe for monthly data
data_quarter <- loadWorkbook_url(2) # Dataframe for quarterly data
data_year <- loadWorkbook_url(3) # Dataframe for annual data
Next, we code the modules. They correspond to the
\(f_j\) mappings in the paper.
The chaining of mappings follows the scheme below:
Because of the strong analogy with neural networks, we approach the coding of mappings/modules similarly to the early implementation of the sequential Keras framework.
First module: decide which sheet to work with (i.e., horizon: monthly, quarterly or annually).
prob_sheet <- c(0,1,0)
module_sheet <- function(prob_sheet){
prob_month <- prob_sheet[1] # Prob. to pick sheet 1 (monthly)
prob_quarter <- prob_sheet[2] # Prob. to pick sheet 2 (quarterly)
prob_year <- prob_sheet[3] # Prob. to pick sheet 3 (yearly)
if((prob_month == 0) + (prob_quarter == 0) + (prob_year == 0) == 2){
if(prob_month == 1){return(data_month)}
if(prob_quarter == 1){return(data_quarter)}
if(prob_year == 1){return(data_year)}
} else
p = runif(1) # This is the random option (not used below)
if(p <= prob_month){return(data_month)}
else if(p <= prob_month+prob_quarter){return(data_quarter)}
else {return(data_year)}
}
module_sheet(prob_sheet)[250:260, 1:11] |> # A snapshot
mutate(across(everything(), as.numeric)) |>
knitr::kable(font_size = 5)
| yyyyq | Index | D12 | E12 | b/m | tbl | AAA | BAA | lty | cay | ntis |
|---|---|---|---|---|---|---|---|---|---|---|
| 19332 | 10.91 | 0.4700 | 0.4250 | 0.8335032 | 0.0007 | 0.0446 | 0.0707 | 0.0306 | NaN | 0.0007609 |
| 19333 | 9.83 | 0.4550 | 0.4325 | 0.8679966 | 0.0004 | 0.0436 | 0.0727 | 0.0308 | NaN | -0.0004585 |
| 19334 | 10.10 | 0.4400 | 0.4400 | 0.8290260 | 0.0029 | 0.0450 | 0.0775 | 0.0336 | NaN | 0.0061958 |
| 19341 | 10.75 | 0.4425 | 0.4525 | 0.8025122 | 0.0024 | 0.0413 | 0.0626 | 0.0307 | NaN | 0.0096063 |
| 19342 | 9.81 | 0.4450 | 0.4650 | 0.8407311 | 0.0015 | 0.0393 | 0.0606 | 0.0289 | NaN | 0.0059828 |
| 19343 | 9.10 | 0.4475 | 0.4775 | 0.8703644 | 0.0021 | 0.0396 | 0.0657 | 0.0310 | NaN | 0.0194390 |
| 19344 | 9.50 | 0.4500 | 0.4900 | 0.7737409 | 0.0023 | 0.0381 | 0.0623 | 0.0293 | NaN | 0.0208127 |
| 19351 | 8.47 | 0.4500 | 0.7300 | 0.8007541 | 0.0015 | 0.0367 | 0.0620 | 0.0274 | NaN | 0.0120460 |
| 19352 | 10.23 | 0.4400 | 0.8100 | 0.6818182 | 0.0015 | 0.0361 | 0.0577 | 0.0270 | NaN | 0.0193440 |
| 19353 | 11.59 | 0.4400 | 0.7600 | 0.6117344 | 0.0020 | 0.0359 | 0.0553 | 0.0282 | NaN | 0.0080266 |
| 19354 | 13.43 | 0.4700 | 0.7600 | 0.5599112 | 0.0015 | 0.0344 | 0.0530 | 0.0276 | NaN | 0.0086888 |
After the first module, each mapping takes data as first argument. This will make it easy to use pipes (sequences of instructions).
Second module: data cleaning.
NOTE: all columns must be converted to numbers beforehand.
module_cleaning <- function(data, prob_remove){ # Two inputs: data & choice
data <- data |> mutate(across(everything(), as.numeric)) # Convert everything to numbers
if(runif(1) <= prob_remove){ # If remove:
return(data |> na.omit()) # REMOVE!
} else
return(data |> mutate(across(everything(), na.locf, na.rm = F))) # Impute from previous non-NA (not very useful here)
}
Third module: winsorizing.
We compute some values of predictors in this module (from the raw
series): payout, dfr and
dfy.
The winsorization comes after. The only input is the level for the
winsorization.
module_winsorization <- function(data, threshold){
data <- data |> mutate(payout = log(D12/E12),
dfy = BAA - AAA,
dfr = corpr - ltr)
data <- data |> mutate(across(2:ncol(data), Winsorize, probs = c(threshold, 1-threshold), na.rm = T))
return(data)
}
Fourth module: levels vs. differences.
This step decides if independent variables are left as levels, or if
variations are used instead.
differentiate <- function(v){v - lag(v)}
module_levels <- function(data, prob_levels){
if(runif(1) < prob_levels){return(data)}
else return(data |> mutate(across(3:ncol(data), differentiate)) |> na.omit())
}
Fifth module: independent variable.
This stage sets the (unique) predictor. There are 6 in the study.
module_independent <- function(data, variable){
data |> dplyr::select(all_of(c("Index", variable, "Rfree")))
}
Sixth module: horizon.
This module determines the horizon of the dependent variable
(return).
It is given in periods (thus, results are contingent on the initial
choice of the sheet (first module)).
module_horizon <- function(data, horizon){
data[,1] <- lead(data[,1], horizon)/data[,1] - 1 - horizon * data[,3] # Excess returns on the Index (S&P 500)
data
}
Seventh module: starting point.
In our study, we allow for simple subsampling.
The idea is that predictability should not be (too much) time-dependent
(though this may
be the case).
Thus, we allow for 3 sample size options:
- full sample
- sample that begins in the middle of the available data (second
half)
- sample that stop at the the middle of the data (first
half)
This module picks the starting point (either the original first point, or the middle point).
module_start <- function(data, start_quantile){
L <- nrow(data)
data[ceiling(L*start_quantile):L, ]
}
Eighth module: stopping point.
This module picks the stopiing point (either the original last point, or the middle point).
module_stop <- function(data, stop_quantile){
L <- nrow(data)
data[1:ceiling(L*stop_quantile), ]
}
Ninth module: estimator type.
This last step takes the data and performs estimations.
Two options are possible: the standard iid estimator, or the HAC
estimator from Newey-West.
The final function yields 10 outputs:
- the coefficient, the standard error, the t-statistic, the
p-value, the skewness of errors, which are baseline
outputs;
- the two criteria (AIC & BIC), for frequentist model
averaging;
- the sample size, the sum of squared residuals and the
variance of the dependent variable, for the Bayesian model
averaging.
module_estimator <- function(data, estimator){
data <- data |> na.omit() # Removes NAs
y <- data |> pull(1) # Naming the dependent variable
x <- data |> pull(2) # Naming the independent variable
x <- x / sd(x) # Scaling the predictors
if(nrow(data) < 31){ # Test for sample size
t_stat <- NA # NA if too small (not enough obs.)
sd <- NA
coef <- NA
p_val <- NA
aic <- NA
bic <- NA
sum_sq_err <- NA
var_y <- NA
skew <- NA
} else {
if(estimator == "HAC"){
model <- lm(y ~ x) # Running the regression
coef <- summary(model)$coef[2,1] # OLS coefficient
sd <- coeftest(model, vcov. = NeweyWest(model))[2,2] # NW standard error
t_stat <- coeftest(model, vcov. = NeweyWest(model))[2,3] # OLS with Newey-West Cov matrix
p_val <- coeftest(model, vcov. = NeweyWest(model))[2,4] # p-value
aic <- AIC(model) # Akaike Info Criterion
bic <- BIC(model) # Bayesian Info Criterion
sum_sq_err <- sum(model$residuals^2) # Sum of squared residuals
var_y <- var(y) # Variance of dependent variable
skew <- mean((model$residuals/sd(model$residuals))^3) # Skewness of residuals
}
if(estimator == "OLS"){
model <- lm(y ~ x) # Running the regression
coef <- summary(model)$coef[2,1] # OLS coef
sd <- summary(model)$coef[2,2] # OLS coef
t_stat <- summary(model)$coef[2,3] # OLS t-statistic
p_val <- summary(model)$coef[2,4] # p-value
aic <- AIC(model) # Akaike Info Criterion
bic <- BIC(model) # Bayesian Info Criterion
sum_sq_err <- sum(model$residuals^2) # Sum of squared residuals
var_y <- var(y) # Variance of dependent variable
skew <- mean((model$residuals/sd(model$residuals))^3) # Skewness of residuals
}
if(estimator == "AUG"){
model_x <- lm(x ~ lag(x)) # x-regression
rho <- model_x$coef[2]
nu <- lead(x) - rho * x # x-innovations
model <- lm(y ~ x + nu)
coef <- summary(model)$coef[2,1] # OLS coef
sd <- summary(model)$coef[2,2] # OLS coef
t_stat <- summary(model)$coef[2,3] # OLS t-statistic
p_val <- summary(model)$coef[2,4] # p-value
aic <- AIC(model) # Akaike Info Criterion
bic <- BIC(model) # Bayesian Info Criterion
sum_sq_err <- sum(model$residuals^2) # Sum of squared residuals
var_y <- var(y) # Variance of dependent variable
skew <- mean((model$residuals/sd(model$residuals))^3) # Skewness of residuals
}
if(estimator == "AUG-HAC"){
model_x <- lm(x ~ lag(x)) # x-regression
rho <- model_x$coef[2]
nu <- lead(x) - rho * x # x-innovations
model <- lm(y ~ x + nu)
coef <- summary(model)$coef[2,1] # OLS coef
sd <- coeftest(model, vcov. = NeweyWest(model))[2,2] # NW standard error
t_stat <- coeftest(model, vcov. = NeweyWest(model))[2,3] # OLS with Newey-West Cov matrix
p_val <- coeftest(model, vcov. = NeweyWest(model))[2,4] # p-value
aic <- AIC(model) # Akaike Info Criterion
bic <- BIC(model) # Bayesian Info Criterion
sum_sq_err <- sum(model$residuals^2) # Sum of squared residuals
var_y <- var(y) # Variance of dependent variable
skew <- mean((model$residuals/sd(model$residuals))^3) # Skewness of residuals
}
}
return(list(coef = coef, sd = sd, t_stat = t_stat, p_val = p_val,
sample_size = nrow(data), aic = aic, bic = bic,
sum_sq_err = sum_sq_err, var_y = var_y, skew = skew))
}
We show a simple test below.
It shows the natural chaining of the mappings, akin to
a neural network (without back-propagation!).
All modules are executed successively.
prob_sheet <- c(1,0,0) # Monthly data
prob_remove <- 1 # Remove NAs
threshold <- 0.01 # Threshold for winsorization
prob_levels <- 0 # Switch to differences
variable <- "D12" # Which independent variable
horizon <- 3 # Return horizon
start_quantile <- 0.5 # Starting point
stop_quantile <- 1 # Stopping point
estimator <- "AUG" # Which estimator
module_sheet(prob_sheet = prob_sheet) |> # Step 1: select data frequency
module_cleaning(prob_remove = prob_remove) |> # Step 2: clean the data
module_winsorization(threshold = threshold) |> # Step 3: manage outliers
module_levels(prob_levels = prob_levels) |> # Step 4: decide between levels versus differences
module_independent(variable = variable) |> # Step 5: choose predictor
module_horizon(horizon = horizon) |> # Step 6: pick return horizon (dependent variable)
module_start(start_quantile = start_quantile) |> # Step 7: starting point (for subsampling)
module_stop(stop_quantile = stop_quantile) |> # Step 8: ending point (for subsampling)
module_estimator(estimator = estimator) # Step 9: select estimator type
## $coef
## [1] 0.006294986
##
## $sd
## [1] 0.003970508
##
## $t_stat
## [1] 1.585436
##
## $p_val
## [1] 0.1136841
##
## $sample_size
## [1] 391
##
## $aic
## [1] -882.1459
##
## $bic
## [1] -866.2813
##
## $sum_sq_err
## [1] 2.329944
##
## $var_y
## [1] 0.006023265
##
## $skew
## [1] -0.4276608
Below, we code a function that takes all parameters
and that runs the whole chain of operators/mappings.
Because we consider a uniform distribution for the sheet (horizons), we
recode the function slightly so that the corresponding input is a simple
string.
module_aggregator <- function(sheet = sheet,
prob_remove = prob_remove,
threshold = threshold,
prob_levels = prob_levels,
variable = variable,
horizon = horizon,
start_quantile = start_quantile,
stop_quantile = stop_quantile,
estimator = estimator){
if(sheet == "Monthly"){prob_sheet <- c(1,0,0)}
if(sheet == "Quarterly"){prob_sheet <- c(0,1,0)}
if(sheet == "Yearly"){prob_sheet <- c(0,0,1)}
module_sheet(prob_sheet = prob_sheet) |>
module_cleaning(prob_remove = prob_remove) |>
module_winsorization(threshold = threshold) |>
module_levels(prob_levels = prob_levels) |>
module_independent(variable = variable) |>
module_horizon(horizon = horizon) |>
module_start(start_quantile = start_quantile) |>
module_stop(stop_quantile = stop_quantile) |>
module_estimator(estimator = estimator)
}
Next, we span all the combinations we want to test. Implicitly, they are given the same (uniform) probability.
sheet <- c("Monthly", "Quarterly", "Yearly")
prob_remove <- c(0, 1)
threshold <- c(0, 0.01, 0.02, 0.03)
prob_levels <- c(0, 1)
variable <- c("payout", "b/m", "svar", "dfr", "dfy", "ntis")
horizon <- c(1, 3, 6, 12)
start_quantile <- c(0, 0.5)
stop_quantile <- c(0.5, 1)
estimator <- c("OLS", "HAC", "AUG", "AUG-HAC")
pars <- expand_grid(sheet, prob_remove, threshold, prob_levels, variable, horizon, start_quantile, stop_quantile, estimator)
pars <- pars |> filter(!(start_quantile == 0.5 & stop_quantile == 0.5))
# sheet <- pars[,1]
# prob_remove <- pars[,2]
# threshold <- pars[,3]
# prob_levels <- pars[,4]
# variable <- pars[,5]
# horizon <- pars[,6]
# start_quantile <- pars[,7]
# stop_quantile <- pars[,8]
# estimator <- pars[,9]
We then launch the computation of the outputs across all 13824
combinations.
Note that with the functional programming abilities of
R, this is incredibly code efficient.
To speed up things, we resort to parallelization over 4
cores.
plan(multisession, workers = 4) # Set the number of cores
output <- future_pmap_dfr(pars, module_aggregator) # Launch!
results <- bind_cols(pars, output) # Bind parameters to results
save(results, file = "results.RData")
Here, we adjust the distribution of t-statistics to account for the possible error-in-variable (EIV) bias, as in Proposition 2 of Skill, Scale, and Value Creation in the Mutual Fund Industry.
bias_terms <- function(x, m, s, t){ # We follow the notations of Proposition 2
# The first 5 values are the sample moments of the bivariate series
m_m <- mean(m, na.rm = T)
s_m <- sd(m, na.rm = T)
m_s <- mean(s, na.rm = T)
s_s <- sd(s, na.rm = T)
rho <- cor(m,s, use = "complete.obs")
# Next, the optimal h, along with bar(m1) and bar(m2)
h <- (3/4/s_m^5*(rho^4*s_s^4/12-(rho*s_s)^2+1)*exp(m_s+s_s^2*(1-rho^2/2)/2)*(length(m)/mean(t)))^(-1/3)
m1 <- (x-m_m)/s_m
m2 <- (x-m_m-rho*s_m*s_s)/s_m
bs1 <- h^2/2/s_m^2*(m1^2-1)/s_m*dnorm(m1)
bs2 <- (1/2/mean(t)*exp(m_s+s_s^2/2)/s_m^2*(m2^2-1))/s_m*dnorm(m2)
return(bs1 + bs2)
}
adjust_bias <- function(xx, m, s, t){
integrate(bias_terms,
-Inf, xx,
m = m, s = s, t = t)$value
}
adjust_cdf_v <- Vectorize(adjust_bias, vectorize.args='xx')
# Fine discretization
grid_01 <- seq(min(results$t_stat, na.rm = T)-0.1, max(results$t_stat, na.rm = T)+0.1, length.out = 2*nrow(results))
# Next step is long because of integration
Phi_adj <- adjust_cdf_v(grid_01,
m = results$t_stat,
s = results$sd,
t = results$sample_size) + ecdf(results$t_stat)(grid_01)
Phi_adj <- Phi_adj[Phi_adj < 1] |> sort()
# Create inverse function by swapping x & y
quant <- stepfun(Phi_adj, grid_01[1:(length(Phi_adj)+1)])
res_adj <- results
res_adj$t_stat <- quant(ecdf(results$t_stat)(results$t_stat))
res_adj <- res_adj |> mutate(adjusted = "Yes")
res_adj$coef <- NA
res_adj$sd <- NA
res_adj$p_val <- 2*(1-pt(abs(res_adj$t_stat), results$sample_size))
results <- results |> mutate(adjusted = "No")
results <- bind_rows(results, res_adj)
results # Have a look
## # A tibble: 27,648 × 20
## sheet prob_remove threshold prob_…¹ varia…² horizon start…³ stop_…⁴ estim…⁵
## <chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <chr>
## 1 Monthly 0 0 0 payout 1 0 0.5 OLS
## 2 Monthly 0 0 0 payout 1 0 0.5 HAC
## 3 Monthly 0 0 0 payout 1 0 0.5 AUG
## 4 Monthly 0 0 0 payout 1 0 0.5 AUG-HAC
## 5 Monthly 0 0 0 payout 1 0 1 OLS
## 6 Monthly 0 0 0 payout 1 0 1 HAC
## 7 Monthly 0 0 0 payout 1 0 1 AUG
## 8 Monthly 0 0 0 payout 1 0 1 AUG-HAC
## 9 Monthly 0 0 0 payout 1 0.5 1 OLS
## 10 Monthly 0 0 0 payout 1 0.5 1 HAC
## # … with 27,638 more rows, 11 more variables: coef <dbl>, sd <dbl>,
## # t_stat <dbl>, p_val <dbl>, sample_size <int>, aic <dbl>, bic <dbl>,
## # sum_sq_err <dbl>, var_y <dbl>, skew <dbl>, adjusted <chr>, and abbreviated
## # variable names ¹prob_levels, ²variable, ³start_quantile, ⁴stop_quantile,
## # ⁵estimator
## # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
Below, we plot the distribution of t-statistics and p-values.
results |> ggplot(aes(x = t_stat, fill = sheet)) +
geom_histogram(position = "dodge") + theme_minimal() +
scale_fill_viridis(discrete = T, begin = 0.1, end = 0.9, name = "Frequency", option = "plasma") +
theme(axis.title.y = element_blank(),
legend.position = c(0.8,0.8))
p1 <- results |>
mutate(type = p_val < 0.1) |>
ggplot(aes(x = p_val, fill = type)) + geom_histogram(bins = 40) +
theme_minimal() +
scale_fill_manual(values = c("#ADADAD", "#666666")) +
theme(axis.title = element_blank(),
text = element_text(size = 14),
legend.position = "none") +
annotate("rect", xmin = 0.19, xmax = 0.55, ymin = 500, ymax = 580, fill = "#FFFFFF") +
annotate("text", x = 0.39, y = 560, label = "bold(full~distribution)", parse = TRUE)
p2 <- results |> ggplot(aes(x = p_val)) + geom_histogram(bins = 20, aes(y=..density..), position = "identity") +
theme_minimal() + xlim(0,0.1) + ylim(0,17) +
theme(axis.title = element_blank(),
panel.background = element_rect(fill = "white", color = "white"),
panel.grid.major = element_line(size = 0.5, linetype = 'solid',
colour = "#CCCCCC"),
plot.background = element_rect(fill = "white", color = "white")) +
annotate("rect", xmin = 0.03, xmax = 0.07, ymin = 11, ymax = 15, fill = "#FFFFFF") +
annotate("text", x = 0.05, y = 13, label = "bold(p-curve)", parse = TRUE) +
geom_density(color = "#38BDFC", size = 1.3)
p3 <- p1 + inset_element(p2, left = 0.45, bottom = 0.55, right = 1, top = 1)
p4 <- results |> ggplot(aes(x = p_val, color = variable)) + stat_ecdf(geom = "step") +
theme_minimal() +
theme(legend.position = c(0.8,0.3),
text = element_text(size = 14),
legend.background = element_rect(fill = "white", color = "white"),
legend.title = element_text(face = "bold"),
axis.title = element_blank())
p3 + p4 + plot_layout(widths = c(3, 2))
ggsave("p_val_dist.pdf", width = 8, height = 4)
Effects of dual operators: we show the histograms of t-stats when we fix one choice for binary mappings.
g_remove <- results |>
ggplot(aes(x = t_stat, fill = as.factor(prob_remove))) + theme_minimal() +
geom_histogram(position = "identity", alpha = 0.7) +
theme(legend.position = "top",
text = element_text(size = 14),
axis.title = element_blank(),
legend.title = element_text(face = "bold", size = 12)) +
scale_fill_manual(name = "Data cleaning", labels = c("imputation", "exclusion"), values = c("#66EEFF", "#555555")) +
guides(fill = guide_legend(nrow = 2, byrow = TRUE))
g_levels <- results |>
ggplot(aes(x = t_stat, fill = as.factor(prob_levels))) + theme_minimal() +
geom_histogram(position = "identity", alpha = 0.7) +
theme(legend.position = "top",
text = element_text(size = 14),
axis.title = element_blank(),
legend.title = element_text(face = "bold", size = 12)) +
guides(fill = guide_legend(nrow = 2, byrow = TRUE)) +
scale_fill_manual(name = "Variable engineering", labels = c("differences", "levels"), values = c("#FF66FF", "#555555"))
g_est <- results |>
ggplot(aes(x = t_stat, fill = as.factor(estimator))) + theme_minimal() +
geom_histogram(position = "identity", alpha = 0.5) +
theme(legend.position = "top",
text = element_text(size = 14),
axis.title = element_blank(),
legend.title = element_text(face = "bold", size = 12)) +
scale_fill_manual(name = "Estimator", labels = c("AUG", "OLS", "HAC", "AUG-HAC"),
values = c("#000000", "#222277", "#339933", "#DDDD33")) +
guides(fill = guide_legend(nrow = 2, byrow = TRUE))
g_remove + g_levels + g_est
ggsave("histos.pdf", width = 9, height = 4)
Next, we plot the boxplots of t-statistics: this gives a
broad view of their distribution.
We split the analysis according to variables and return horizon, and
also differentiate between data frequencies.
g1 <- results |>
ggplot(aes(x = variable, y = t_stat, color = sheet)) +
geom_boxplot() + theme_minimal() +
geom_hline(yintercept = 2, linetype = 2) + geom_hline(yintercept = -2, linetype = 2) +
scale_color_manual(values = c("#000000", "#666666", "#BBBBBB"), name = "frequency") +
theme(legend.position = c(0.5,0.85),
legend.title = element_text(face = "bold"),
text = element_text(size = 15),
axis.title.y = element_blank(),
axis.title.x = element_text(face = "bold"))
g2 <- results |>
ggplot(aes(x = as.factor(horizon), y = t_stat, color = sheet)) +
geom_boxplot() + theme_minimal() +
geom_hline(yintercept = 2, linetype = 2) + geom_hline(yintercept = -2, linetype = 2) +
scale_color_manual(values = c("#000000", "#666666", "#BBBBBB")) +
theme(legend.position = "none",
text = element_text(size = 15),
axis.title.y = element_blank(),
axis.title.x = element_text(face = "bold")) +
xlab("return horizon (periods)")
g1 + g2
ggsave("map.pdf", width = 8, height = 5)
The plot below shows the intervals when fixing exactly one mapping, while leaving all other mappings free. This generates many combinations, for each of the 28 possible fixed choices.
module_names <- c("sheet", "prob_remove", "threshold", "prob_levels", "variable",
"horizon", "start_quantile", "stop_quantile", "estimator", "adjusted")
results |>
mutate(across(all_of(module_names), as.character)) |>
pivot_longer(cols = all_of(module_names), names_to = "module", values_to = "value") |>
group_by(module, value) |>
summarise(min = min(t_stat, na.rm = T),
max = max(t_stat, na.rm = T)) |>
mutate(variable = paste(module, value)) |>
ggplot(aes(y = variable)) + geom_vline(xintercept = 0, linetype = 2) +
geom_errorbarh(aes(xmin = min, xmax = max, color = module)) +
theme_bw() + ylab("") + xlab("hacking interval (t-statistics)") +
theme(legend.title = element_text(face = "bold"),
axis.title.x = element_text(face = "bold")) +
guides(color = guide_legend(reverse = TRUE)) +
scale_color_manual(values = c("#3D6499", "#FCA906", "#FC062F", "#9457CF", "#777777", "#000000", "#84D0B1", "#568751", "#7B6B4A", "#7DD5FE"))
ggsave("intervals.pdf", width = 8, height = 5)
size_full <- vector(mode = "list", length = length(module_names) - 1)
for(j in 1:(length(module_names) - 1)){
#print(j)
combz <- combn(module_names, j) # All possible combinations
for(k in 1:ncol(combz)){
temp <- results |>
group_by_at(combz[,k]) |>
summarise(min = min(t_stat, na.rm = T),
max = max(t_stat, na.rm = T)) |>
mutate(size = max - min)
size_full[[j]] <- c(size_full[[j]], temp |> pull(size))
}
}
First, some analysis.
Notably, we estimate the coefficients from \(y=a+b^n\), where \(y\) is the average range of hacking
intervals and \(n\) is the
number of “free” mappings.
We also examine the ratio between successive numbers of mappings.
We see that even after 5-7 free mappings, each new mapping extends the
range of \(t\)-statistics by more than
30%.
data_combi <- reshape2::melt(size_full) |> filter(is.finite(value)) |> mutate(L1 = 10 - L1)
expanding_model <- lm(log(mean) ~ L1,
data = data_combi |>
group_by(L1) |> summarise(mean = mean(value, na.rm = T)))
summary(expanding_model)
##
## Call:
## lm(formula = log(mean) ~ L1, data = summarise(group_by(data_combi,
## L1), mean = mean(value, na.rm = T)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.36218 -0.07807 0.03954 0.14357 0.17907
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.23890 0.13829 -1.727 0.128
## L1 0.35178 0.02458 14.314 1.93e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1904 on 7 degrees of freedom
## Multiple R-squared: 0.967, Adjusted R-squared: 0.9622
## F-statistic: 204.9 on 1 and 7 DF, p-value: 1.932e-06
data_combi |>
group_by(L1) |>
summarise(ARI = mean(value, na.rm = T)) |>
mutate(rate = ARI / lag(ARI) - 1) |>
rename("J-k" = L1) |>
t() |>
round(3) |>
data.frame() |>
xtable(digits = 3)
## % latex table generated in R 4.2.1 by xtable 1.8-4 package
## % Tue Jan 10 11:28:40 2023
## \begin{table}[ht]
## \centering
## \begin{tabular}{rrrrrrrrrr}
## \hline
## & X1 & X2 & X3 & X4 & X5 & X6 & X7 & X8 & X9 \\
## \hline
## J-k & 1.000 & 2.000 & 3.000 & 4.000 & 5.000 & 6.000 & 7.000 & 8.000 & 9.000 \\
## ARI & 0.779 & 1.656 & 2.662 & 3.847 & 5.278 & 7.048 & 9.285 & 12.149 & 15.745 \\
## rate & & 1.124 & 0.608 & 0.445 & 0.372 & 0.335 & 0.317 & 0.308 & 0.296 \\
## \hline
## \end{tabular}
## \end{table}
Finally, the plot that summarizes the findings.
Note that the figures at the top show the numbers of combinations of
mappings over which the boxplots are drawn.
set.seed(42)
data_combi |>
mutate(size = abs(L1-3.5)) |>
ggplot(aes(x = as.factor(L1), y = value, color = L1)) + geom_boxplot() +
theme_bw() +
xlab("Degrees of freedom (number of free mappings)") +
ylab("range of hacking interval") +
scale_color_viridis_c(direction = -1) +
theme(axis.title = element_text(face = "bold"),
legend.position = "none") +
stat_summary(fun = mean, geom = "point", shape = 20, size = 6, fill = "black")+
geom_function(fun = function(x) exp(expanding_model$coefficients[1]) * exp(expanding_model$coefficients[2])^x) +
geom_label(data = data_combi |> group_by(L1) |> summarize(n = n(), max = 20.2),
aes(x = as.factor(L1), y = max, label = n)) +
annotate("rect", xmin = 0, xmax = 2.8, ymin = 5, ymax = 7, fill = "white", alpha = 0.5) +
annotate("text", x = 1.5, y = 6, label = "many small intervals") +
annotate("text", x = 8.5, y = 2, label = "few large intervals")
ggsave("combinations.pdf", width = 8, height = 5)
The chunk below performs paired Wilcoxon tests for all pairs of
predictors.
The upper triangles pertains to tests on t-stats, while the
lower triangle relates to tests on p-values.
wilcox_tests <- function(X){ # This function runs the Wilcoxon tests
N <- ncol(X)
v <- c() # Initialize output with empty object
for(i in 1:(N-1)){ # Run across the first variable
for(j in (i+1):N){ # Run across the second variable
v <- c(v, wilcox.test(X[,i], X[,j], paired = TRUE)$p.value)
}
}
v
}
test_t_stat <- results |>
select(all_of(c(module_names, "t_stat"))) |>
pivot_wider(names_from = variable, values_from = t_stat) |>
select(all_of(unique(results$variable))) |>
as.matrix() |>
wilcox_tests()
test_p_val <- results |>
select(all_of(c(module_names, "p_val"))) |>
pivot_wider(names_from = variable, values_from = p_val) |>
select(all_of(unique(results$variable))) |>
as.matrix() |>
wilcox_tests()
variables <- unique(results$variable)
S <- diag(length(variables)) * NA
S[lower.tri(S, diag=F)] <- test_t_stat # Filled by column
S <- t(S) # Hence the transpose
S[lower.tri(S, diag=F)] <- test_p_val
colnames(S) <- unique(results$variable)
rownames(S) <- unique(results$variable)
Below, we test if the average of t-stats and p-values linked to one predictor is different from that of all other predictors.
t_stat_test <- 0
p_val_test <- 0
for(j in 1:6){
t_stat_test[j] <- t.test(
results |> filter(variable %in% variables[j]) |> pull(t_stat), # values of predictor j
results |> filter(!(variable %in% variables[j])) |> pull(t_stat) # all other values
)$p.value
p_val_test[j] <- t.test(
results |> filter(variable %in% variables[j]) |> pull(p_val),
results |> filter(!(variable %in% variables[j])) |> pull(p_val)
)$p.value
}
S <- cbind(S, t_stat_test)
S <- rbind(S, c(p_val_test,NA))
round(S,3) |> xtable(digits = 3)
## % latex table generated in R 4.2.1 by xtable 1.8-4 package
## % Tue Jan 10 11:28:48 2023
## \begin{table}[ht]
## \centering
## \begin{tabular}{rrrrrrrr}
## \hline
## & payout & b/m & svar & dfr & dfy & ntis & t\_stat\_test \\
## \hline
## payout & & 0.000 & 0.000 & 0.000 & 0.000 & 0.000 & 0.011 \\
## b/m & 0.002 & & 0.000 & 0.000 & 0.000 & 0.000 & 0.000 \\
## svar & 0.293 & 0.000 & & 0.000 & 0.000 & 0.000 & 0.000 \\
## dfr & 0.000 & 0.000 & 0.000 & & 0.000 & 0.000 & 0.000 \\
## dfy & 0.577 & 0.000 & 0.491 & 0.000 & & 0.000 & 0.000 \\
## ntis & 0.000 & 0.000 & 0.000 & 0.000 & 0.000 & & 0.000 \\
## & 0.036 & 0.000 & 0.000 & 0.000 & 0.012 & 0.000 & \\
## \hline
## \end{tabular}
## \end{table}
Average p-values for two predictors.
results |> filter(variable == "payout") |> pull(p_val) |> mean(na.rm = T)
## [1] 0.4615937
results |> filter(variable == "svar") |> pull(p_val) |> mean(na.rm = T)
## [1] 0.4726539
b/m and ntis are the two most promising variables. Let’s have a closer look at the corresponding \(t\)-statistics. For future use, we estimate their tail behavior beforehand.
t_stat_bm <- results |>
na.omit() |>
filter(variable %in% c("b/m"), prob_levels == 1) |>
pull(t_stat)
generate_all_estimates(t_stat_bm[t_stat_bm > 1])
## Method.of.Estimation Shape.Parameter Scale.Parameter
## 1 Maximum Likelihood Estimate 1.1398302 1.012727
## 2 Least Squares 1.5502124 1.012727
## 3 Method of Moments 1.5321005 1.012727
## 4 Percentiles Method 1.2240918 1.012727
## 5 Modified Percentiles Method 1.3901625 1.012727
## 6 Geometric Percentiles Method 0.9376195 1.012727
## 7 Weighted Least Squares 1.1300860 1.012727
generate_all_estimates(t_stat_bm[t_stat_bm > 1]) |>
pull(Shape.Parameter) |>
mean()
## [1] 1.272015
t_stat_ntis <- results |>
na.omit() |>
filter(variable %in% c("ntis"), prob_levels == 1) |>
pull(t_stat)
generate_all_estimates(-t_stat_ntis[t_stat_ntis < (-1)])
## Method.of.Estimation Shape.Parameter Scale.Parameter
## 1 Maximum Likelihood Estimate 1.249048 1.000833
## 2 Least Squares 1.866523 1.000833
## 3 Method of Moments 1.648273 1.000833
## 4 Percentiles Method 1.597082 1.000833
## 5 Modified Percentiles Method 1.628415 1.000833
## 6 Geometric Percentiles Method 1.165006 1.000833
## 7 Weighted Least Squares 1.239427 1.000833
generate_all_estimates(-t_stat_ntis[t_stat_ntis < (-1)]) |>
pull(Shape.Parameter) |>
mean()
## [1] 1.484825
We discriminate between levels and differences.
g1 <- results |>
filter(variable %in% c("b/m")) |>
mutate(prob_levels = recode_factor(prob_levels, `0` = "Difference", `1` = "Level")) |>
ggplot(aes(x = t_stat, fill = as.factor(estimator))) + geom_histogram(aes(y = ..density..), position = "dodge") +
facet_wrap(vars(prob_levels)) +
theme_bw() +
theme(axis.title.x = element_text(face = "bold"),
legend.position = "bottom",
title = element_text(face = "bold"),
axis.title.y = element_blank()) +
scale_fill_viridis_d(option = "magma", end = 0.85, name = "estimator") +
geom_label(aes(x = 7.5, y = 0.2, label = "b/m"), fill = "white", size = 6) +
#scale_fill_manual(values = c("#AAAACC", "#444444"), name = "Predictor") +
#annotate("text", x = 7.5, y = 0.1, label = "strong\n effects") +
xlab("t-statistic")
g2 <- results |>
filter(variable %in% c("ntis")) |>
mutate(prob_levels = recode_factor(prob_levels, `0` = "Difference", `1` = "Level")) |>
ggplot(aes(x = t_stat, fill = as.factor(estimator))) + geom_histogram(aes(y = ..density..), position = "dodge") +
facet_wrap(vars(prob_levels)) +
theme_bw() +
theme(axis.title.x = element_text(face = "bold"),
legend.position = "none",
title = element_text(face = "bold"),
axis.title.y = element_blank()) +
scale_fill_viridis_d(option = "magma", end = 0.85, name = "estimator") +
#scale_fill_manual(values = c("#AAAACC", "#444444"), name = "Predictor") +
geom_label(aes(x = -7.5, y = 0.2, label = "ntis"), fill = "white", size = 6) +
xlab("t-statistic")
g1 / g2
ggsave("stable.pdf", width = 8, height = 5)
When several estimates have been obtained to quantify one effect, it is natural to seek to aggregate them.
There are many ways to proceed, but we choose a very common form: \[\hat{b}_* =\sum_{j=1}^Jw_j\hat{b}_j\] with \[w_j= \frac{e^{-\Delta_j/2}}{\sum_{k=1}^Je^{-\Delta_k/2}}, \quad \Delta_j = AIC_j-\underset{j}{\min}AIC_j\]
For the estimation of the variance of the aggregate estimator, we follow Equation (1) in Burnham & Anderson (2004): \[\hat{\sigma}^2_*=\left(\sum_{j=1}^J w_j\sqrt{\hat{\sigma}_j + (\hat{b}_*-\hat{b}_j)^2} \right)^2.\]
Below, the confidence intervals are computed as \([\hat{b}_*-1.96\sigma_*^2/\sqrt{T_*},\hat{b}_*+1.96\sigma_*^2/\sqrt{T_*}]\), where \(T_*=\sum_{j=1}^Jw_jT_j\).
m_avg <- results |>
filter(is.finite(coef)) |>
group_by(variable, prob_levels, sheet) |>
mutate(D = (aic - min(aic)),
weight = if_else(is.na(D), 0, exp(-D/2) /sum(exp(-D/2), na.rm = T)),
b = sum(weight * coef, na.rm = T),
n = sum(weight * sample_size, na.rm = T)) |>
summarise(s = sum(weight * sqrt(sd^2+(b-coef)^2), na.rm = T),
b = mean(b),
n = mean(n)) |>
mutate(t = b / s * sqrt(n))
g_diff <- m_avg |>
filter(prob_levels == 0) |>
ggplot(aes(x = reorder(variable, b), y = b, color = sheet)) +
geom_point() + geom_hline(yintercept = 0, linetype = 2) +
geom_errorbar(aes(ymin = b - 1.96 * s / sqrt(n), ymax = b + 1.96 * s / sqrt(n)), width = 0.5) +
theme_minimal() + ylab("coefficient (confidence interval)") +
theme(legend.position = c(0.18,0.8),
text = element_text(size = 12),
title = element_text(face = "bold"),
legend.title = element_text(face = "bold"),
legend.background = element_rect(fill = "white", color = "white"),
axis.title.x = element_blank()) +
labs(color="Frequency",
subtitle = "Difference preddictors")
g_levels <- m_avg |>
filter(prob_levels == 1) |>
ggplot(aes(x = reorder(variable, b), y = b, color = sheet)) +
geom_hline(yintercept = 0, linetype = 2) + geom_point() +
geom_errorbar(aes(ymin = b - 1.96 * s / sqrt(n), ymax = b + 1.96 * s / sqrt(n)), width = 0.5) +
theme_minimal() + ylab("coefficient (confidence interval)") +
theme(legend.position = c(0.18,0.8),
text = element_text(size = 12),
title = element_text(face = "bold"),
legend.title = element_text(face = "bold"),
legend.background = element_rect(fill = "white", color = "white"),
axis.title.x = element_blank()) +
labs(color="Frequency",
subtitle = "Level preddictors")
g_levels + g_diff
ggsave("model_averaging.pdf", width = 8, height = 4)
The quantity of interest is \(b\), with posterior probability given the data \(D\) equal to \[P[b | D]= \sum_{m=1}^MP[b|M_m,D]P[M_m|D],\] where \(\mathbb{M}=\{M_m,m=1,\dots,M\}\) is the set of models under consideration. One model corresponds to one complete path. Notably, the above equation translates to the following conditional average and variance: \[\begin{align} \mathbb{E}[b|D]&=\sum_{m=1}^M\hat{b}_mP[M_m|D] \label{eq:mean} \\ \mathbb{V}[b|D]&=\sum_{m=1}^M \left(\mathbb{V}(b|M_m,D) + \hat{b}^2_m \right)P[M_m|D] - (\mathbb{E}[b|D])^2 \end{align}\] where \(\hat{b}_m\) is the estimated effect in model \(m\). The posterior model probabilities are given by \[\begin{equation} P[M_m|D] = \left(\sum_{j=1}^M \frac{P[M_j]}{P[M_m]}\frac{l_D(M_j)}{l_D(M_m)} \right)^{-1}, \end{equation}\] with \(l_D(M_j)\) being the marginal likelihood of model \(j\). Because we are agnostic with respect to which model is more realistic, we will set the prior odds \(\frac{P[M_j]}{P[M_m]}\) equal to one. Note that in this case, the posterior probabilities are then simply proportional to the likelihoods. The remaining Bayes factor is by far the most complex: \[\begin{equation} \frac{l_D(M_j)}{l_D(M_m)}=\left(\frac{n_j}{n_j+1}\right)^{\frac{k_j}{2}}\left(\frac{n_m+1}{n_m}\right)^{\frac{k_m}{2}}\frac{\left(\frac{s_m+n_mv_m}{n_m+1} \right)^{(n_m-1)/2}}{\left(\frac{s_j+n_jv_j}{n_j+1} \right)^{(n_j-1)/2}}, \label{eq:odds} \end{equation}\] where \(n_j\) is the inverse of the number of observations used in model \(j\) and \(k_j\) is the number of predictors in this model, omitting the constant (\(k_j=1\) in our case). Moreover, \(n_jv_j\) is the sample variance of the dependent variable in model \(j\). Finally, \(s_j\) is the sum of squared residuals under model \(j\).
level.labs <- c("Difference predictor", "Level predictor")
names(level.labs) <- c("0", "1")
results |>
na.omit() |> # Likelihood below
mutate(lik = sqrt(sample_size/(sample_size+1))/((sum_sq_err+var_y)/(sample_size+1))^((sample_size-1)/2)) |>
filter(sheet == "Yearly") |>
group_by(variable, prob_levels) |>
summarize(p = lik / sum(lik),
b = sum(coef * p),
n = sum(sample_size * p),
sdev = sqrt(sum((sd^2+coef^2)*p)-b^2)) |>
ggplot(aes(x = reorder(variable, b))) +
geom_hline(yintercept = 0, linetype = 2) +
theme_minimal() + facet_wrap(vars(prob_levels), labeller = labeller(prob_levels = level.labs)) +
geom_errorbar(aes(ymin = b - 3* sdev/sqrt(n), ymax = b + 3 * sdev/sqrt(n) , width = 0.3), color = "#AAAAAA" ) +
geom_errorbar(aes(ymin = b - 2* sdev/sqrt(n), ymax = b + 2* sdev/sqrt(n) , width = 0.5), color = "#555555") +
geom_errorbar(aes(ymin = b , ymax = b, width = 0.8), color = "black" ) +
theme(axis.title.x = element_blank(),
text = element_text(size = 14),
strip.text = element_text(size = 14, face = "bold")) + ylab("coefficient") +
geom_point(data = m_avg |> filter(sheet == "Yearly"), aes(y = b), color = "red")
#ggsave("Bayes_avg.pdf", width = 8, height = 4)
Below, we plot the estimated rejection rates across all tests, both for each predictor, and for the full sample (ALL, in black).
agg <- results |>
summarise(RR_005 = mean(p_val < 0.005, na.rm = T),
RR_010 = mean(p_val < 0.01, na.rm = T),
RR_050 = mean(p_val < 0.05, na.rm = T),
RR_100 = mean(p_val < 0.1, na.rm = T)) |>
bind_rows(data.frame(RR_005 = 0.005, RR_010 = 0.01, RR_050 = 0.05, RR_100 = 0.1)) |>
bind_cols(variable = c("ALL", "Uniform")) |>
pivot_longer(-variable, names_to = "Threshold", values_to = "RR")
p1 <- results |>
group_by(variable) |>
summarise(RR_005 = mean(p_val < 0.005, na.rm = T),
RR_010 = mean(p_val < 0.01, na.rm = T),
RR_050 = mean(p_val < 0.05, na.rm = T),
RR_100 = mean(p_val < 0.1, na.rm = T)) |>
pivot_longer(-variable, names_to = "Threshold", values_to = "RR") |>
bind_rows(agg) |>
ggplot(aes(x = Threshold, y = RR, color = variable, group = variable)) +
geom_line() + geom_point(size = 2.5) + theme_minimal() +
scale_color_manual(values = c("#000000", "#888888", "#CCCCCC", "#581845", "#C70099", "#FFC300", "#4C9129", "#1199FF")) +
scale_x_discrete(labels = c(
"RR_005" = "0.5%",
"RR_010" = "1%",
"RR_050" = "5%",
"RR_100" = "10%"
)) +
theme(legend.position = c(0.3,0.85),
text = element_text(size = 16),
legend.title = element_text(face = "bold"),
legend.background = element_rect(fill = "white", color = "white"),
axis.title.y = element_blank()) +
guides(color = guide_legend(ncol = 2))
p2 <- results |>
filter(prob_levels == 1) |>
group_by(variable) |>
summarise(RR_005 = mean(p_val < 0.005, na.rm = T),
RR_010 = mean(p_val < 0.01, na.rm = T),
RR_050 = mean(p_val < 0.05, na.rm = T),
RR_100 = mean(p_val < 0.1, na.rm = T)) |>
pivot_longer(-variable, names_to = "Threshold", values_to = "RR") |>
ggplot(aes(x = Threshold, y = RR, color = variable, group = variable)) +
geom_line() + geom_point(size = 2.5) + theme_minimal() +
scale_color_manual(values = c("#888888", "#CCCCCC", "#581845", "#C70099", "#FFC300", "#4C9129", "#1199FF")) +
scale_x_discrete(labels = c(
"RR_005" = "0.5%",
"RR_010" = "1%",
"RR_050" = "5%",
"RR_100" = "10%"
)) +
theme(legend.position = c(0.25,0.85),
text = element_text(size = 16),
legend.title = element_text(face = "bold"),
legend.background = element_rect(fill = "white", color = "white"),
axis.title.y = element_blank()) +
guides(color = guide_legend(ncol = 2))
p1 + p2
#ggsave("RR.pdf", width = 9, height = 4.5)
On ntis & b/m, from a previous version of the paper.
results |>
filter(variable %in% c("b/m", "ntis")) |>
ggplot(aes(x = t_stat, fill = as.factor(prob_levels))) + geom_vline(xintercept = 0, linetype = 2) +
geom_histogram(alpha = 0.7, position = "dodge") +
facet_wrap(vars(variable)) + theme_minimal() +
xlab("t-statistic") +
theme(legend.position = c(0.5,0.8),
legend.title = element_text(face = "bold"),
legend.background = element_rect(fill = "white", color = "white")) +
scale_fill_manual(name = "predictor", labels = c("Differences", "Levels"), values = c("#4DA2F1", "#F1A44D"))
Finally, we illustrate the differences with a jitter plot.
set.seed(42) # Random seed - for reproducibility
results |>
filter(variable %in% c("b/m", "ntis")) |>
mutate(horizon = as.factor(horizon),
horizon = recode_factor(horizon,
`1` = "1 period",
`3` = "3 periods",
`6` = "6 periods",
`12` = "12 periods")) |>
ggplot(aes(x = sheet, y = t_stat, color = as.factor(prob_levels))) +
geom_hline(yintercept = 0, linetype = 3, color = "grey") +
geom_hline(yintercept = -2, linetype = 2) +
geom_hline(yintercept = 2, linetype = 2) +
scale_x_discrete() +
geom_jitter(alpha = 0.7) +
facet_grid(vars(variable), vars(horizon), scales = "free") + theme_minimal() +
ylab("t-statistic") +
theme(legend.title = element_text(face = "bold"),
axis.title.x = element_blank(),
strip.text = element_text(face = "bold"),
legend.position = c(0.15,0.12),
text = element_text(size = 13),
legend.background = element_rect(fill = "white", color = "white")) +
scale_color_manual(name = "predictor", labels = c("Differences", "Levels"), values = c("#4DA2F1", "#F1A44D")) #+
#geom_label(data = table, aes(label = t), show_guide = FALSE)
#ggsave("focus.pdf", width = 9, height = 4)
A look at tails of t-stat distribution
g1 <- results |>
filter(variable %in% c("b/m")) |>
mutate(prob_levels = recode_factor(prob_levels, `0` = "Difference", `1` = "Level")) |>
ggplot(aes(x = t_stat, fill = as.factor(prob_levels))) + geom_histogram(aes(y = ..density..), position = "dodge") +
theme_bw() +
theme(axis.title.x = element_text(face = "bold"),
legend.position = "none",
title = element_text(face = "bold"),
axis.title.y = element_blank()) +
scale_fill_manual(values = c("#AAAACC", "#444444"), name = "Predictor") +
annotate("text", x = 7.5, y = 0.1, label = "strong\n effects") +
xlab("t-statistic") + ggtitle("b/m") +
geom_function(fun = dstable, colour = "red",
args = list(alpha = 1.24, beta = 1, gamma = 1,
delta = results |> filter(variable %in% c("b/m"), prob_levels == 1) |> pull(t_stat) |> mean(na.rm=T)))
g2 <- results |>
filter(variable %in% c("ntis")) |>
mutate(prob_levels = recode_factor(prob_levels, `0` = "Difference", `1` = "Level")) |>
ggplot(aes(x = t_stat, fill = as.factor(prob_levels))) + geom_histogram(aes(y = ..density..),position = "dodge") +
theme_bw() +
theme(legend.position = c(0.3,0.8),
axis.title.x = element_text(face = "bold"),
title = element_text(face = "bold"),
axis.title.y = element_blank()) +
scale_fill_manual(values = c("#AAAACC", "#444444"), name = "Predictor") +
annotate("text", x = -7.5, y = 0.1, label = "strong\n effects") +
xlab("t-statistic") + ggtitle("ntis") +
geom_function(fun = dstable, colour = "red",
args = list(alpha = 1.45, beta = -1, gamma = 1,
delta = results |> filter(variable %in% c("ntis"), prob_levels == 1) |> pull(t_stat) |> mean(na.rm=T)))
g1 + g2
Below, we estimate the tail exponent of errors in
simple monthly predictive regressions.
Left and right tails are treated separately.
x <- lag(as.numeric(data_month$`b/m`))
y <- data_month$Index/lag(data_month$Index)-1
data_bm <- data.frame(x = x, y = y) |> na.omit()
res <- lm(data_bm$y ~ data_bm$x)$residuals # Residuals from model with b/m predictor
table <- generate_all_estimates(-res[res < 0])[,1:2]
colnames(table) <- c("Method", "b/m, left")
table <- table |>
bind_cols(`b/m, right` = generate_all_estimates(res[res > 0])$Shape.Parameter)
x <- lag(as.numeric(data_month$`ntis`))
y <- data_month$Index/lag(data_month$Index)-1
res <- lm(y ~ x)$residuals # Residuals from model with b/m predictor
table <- table |>
bind_cols(`ntis, left` = generate_all_estimates(-res[res < 0])$Shape.Parameter) |>
bind_cols(`ntis, right` = generate_all_estimates(res[res > 0])$Shape.Parameter)
#xtable(table, digits = 3)
table
## Method b/m, left b/m, right ntis, left ntis, right
## 1 Maximum Likelihood Estimate 0.2428556 0.2187568 0.1161515 0.1456144
## 2 Least Squares 0.7035374 0.7174895 0.5567472 0.6551829
## 3 Method of Moments 1.0100470 1.0066174 1.0001024 1.0006469
## 4 Percentiles Method 0.7717784 0.9062372 0.7549434 0.8492861
## 5 Modified Percentiles Method 1.0634475 1.4983419 1.0923893 1.3901179
## 6 Geometric Percentiles Method 0.4934899 0.5580351 0.4567327 0.5277124
## 7 Weighted Least Squares 0.2410709 0.2173651 0.1152814 0.1446107
Another look at the results
results |>
ggplot(aes(x = t_stat, fill = as.factor(prob_levels))) +
geom_histogram(position = "dodge") + theme_bw() +
facet_wrap(vars(variable), scales = "free") +
xlab("t-statistic") +
theme(legend.position = c(0.44,0.85),
legend.title = element_text(face = "bold"),
axis.title.y = element_blank()) +
scale_fill_manual(values = c("#1155DD", "#FFBB11"), name = "regressor", labels = c("differences", "levels"))
ggsave("hist_vars.pdf", width = 8, height = 5)